Imagine you are the architect of a grand labyrinth. To find the shortest path from every single room to the exit, you don't need to walk them all. If you have a perfect map (the MDP dynamics), you can simply ask every room's neighbors how far they are from the goal and add your own travel cost. This is the essence of Dynamic Programming (DP).
The Theoretical Pillars
- Full Backups: Unlike sampling methods that roll the dice on a single trajectory, DP performs a full backup. It looks at every possible successor state $s'$ and every possible reward $r$, weighting them by the transition probability $p(s', r | s, a)$.
- Bootstrapping: DP is built on a leap of faith. We update our estimate for a state's value $V(s)$ using the current estimates of its neighbors' values $V(s')$. We don't wait for final results; we refine our guesses based on other guesses.
- Optimal Fixed Points: The Bellman optimality equation guarantees that there is a unique $v_*$ that satisfies the recursive relationship. Iterative DP is the mathematical engine that drives our current value function toward this fixed point.